# Load packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1. Model Building

We will work with data about variables related to blood pressures of Peruvian men who have moved from rural high altitude areas to urban lower altitude areas. A study related to that topic and including similar models is:

Hirschler, V., Gonzalez, C., Molinari, C., Velez, H., Nordera, M., Suarez, R., & Robredo, A. (2019). Blood pressure level increase with altitude in three argentinean indigenous communities. AIMS Public Health, 6(4), 370. https://dx.doi.org/10.3934%2Fpublichealth.2019.4.370

# Upload the data from GitHub
peru_bp <- read_csv("https://raw.githubusercontent.com/laylaguyot/datasets/main//peru_bp.csv")
## Rows: 39 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): Age, Years, FracLife, Weight, Height, Chin, Forearm, Calf, Pulse, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View first 6 rows
head(peru_bp)
## # A tibble: 6 × 11
##     Age Years FracLife Weight Height  Chin Forearm  Calf Pulse Systol Diastol
##   <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl>   <dbl> <dbl> <dbl>  <dbl>   <dbl>
## 1    21     1   0.0476   71     1629   8       7    12.7    88    170      76
## 2    22     6   0.273    56.5   1569   3.3     5     8      64    120      60
## 3    24     5   0.208    56     1561   3.3     1.3   4.3    68    125      75
## 4    24     1   0.0417   61     1619   3.7     3     4.3    52    148     120
## 5    25     1   0.04     65     1566   9      12.7  20.7    72    140      78
## 6    27    19   0.704    62     1639   3       3.3   5.7    72    106      72

a. Exploring data

It is always a good idea to explore our data:

library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
# A fancy scatterplot matrix
pairs.panels(peru_bp[c("Systol","Age","Years","FracLife","Weight","Height","Chin","Forearm","Calf","Pulse")], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB", # color of histogram
             smooth = FALSE, density = FALSE, ellipses = FALSE)

Even though pairwise comparisons do not let us see how multiple predictors might affect the response, we can get a sense of the relationships between each predictor and the response and between the predictors.

b. Forward selection

We start by comparing two models: a null model (only the intercept) with a full model with all potential predictors.

# Null model
fit_start <- lm(Systol ~ 1, data = peru_bp)

# Full model
fit_full <- lm(Systol ~ Age + Years + FracLife + Weight + Height + Chin + Forearm + Calf + Pulse, 
               data = peru_bp)

# Forward stepwise selection
fit_step <- step(fit_start, direction = "forward", scope = formula(fit_full))
## Start:  AIC=201.71
## Systol ~ 1
## 
##            Df Sum of Sq    RSS    AIC
## + Weight    1   1775.38 4756.1 191.34
## + FracLife  1    498.06 6033.4 200.62
## + Forearm   1    484.22 6047.2 200.71
## + Calf      1    410.80 6120.6 201.18
## <none>                  6531.4 201.71
## + Height    1    313.58 6217.9 201.79
## + Chin      1    189.19 6342.2 202.57
## + Pulse     1    119.88 6411.6 202.99
## + Years     1     49.98 6481.5 203.41
## + Age       1      0.22 6531.2 203.71
## 
## Step:  AIC=191.34
## Systol ~ Weight
## 
##            Df Sum of Sq    RSS    AIC
## + FracLife  1   1314.69 3441.4 180.72
## + Years     1    972.90 3783.2 184.41
## + Age       1    385.73 4370.3 190.04
## <none>                  4756.1 191.34
## + Chin      1    143.63 4612.4 192.15
## + Calf      1     16.67 4739.4 193.20
## + Pulse     1      5.31 4750.8 193.30
## + Height    1      2.01 4754.0 193.32
## + Forearm   1      1.16 4754.9 193.33
## 
## Step:  AIC=180.72
## Systol ~ Weight + FracLife
## 
##           Df Sum of Sq    RSS    AIC
## + Chin     1   197.372 3244.0 180.42
## <none>                 3441.4 180.72
## + Years    1   113.588 3327.8 181.41
## + Age      1   100.451 3340.9 181.57
## + Forearm  1    50.548 3390.8 182.15
## + Calf     1    30.218 3411.1 182.38
## + Height   1    23.738 3417.6 182.45
## + Pulse    1     6.878 3434.5 182.64
## 
## Step:  AIC=180.42
## Systol ~ Weight + FracLife + Chin
## 
##           Df Sum of Sq    RSS    AIC
## <none>                 3244.0 180.42
## + Age      1   132.575 3111.4 180.79
## + Height   1   113.565 3130.4 181.03
## + Years    1   101.114 3142.9 181.18
## + Pulse    1    13.001 3231.0 182.26
## + Forearm  1     0.219 3243.8 182.42
## + Calf     1     0.003 3244.0 182.42
summary(fit_step)
## 
## Call:
## lm(formula = Systol ~ Weight + FracLife + Chin, data = peru_bp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.6382  -6.6316   0.4521   6.3593  24.2086 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  52.9092    15.0895   3.506 0.001266 ** 
## Weight        1.4407     0.2766   5.209 8.51e-06 ***
## FracLife    -27.3522     7.1185  -3.842 0.000491 ***
## Chin         -1.0135     0.6945  -1.459 0.153407    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.627 on 35 degrees of freedom
## Multiple R-squared:  0.5033, Adjusted R-squared:  0.4608 
## F-statistic: 11.82 on 3 and 35 DF,  p-value: 1.684e-05

At each step, R considers adding each available predictor (+). It calculates the AIC (Akaike Information Criterion) for the model if that predictor is added: R chooses the variable that gives the greatest improvement (lowest AIC). We repeat this process until no predictor improves AIC, and the model is finalized.

Why AIC? AIC balances fit and complexity. A lower AIC means a better trade-off between:

  • Goodness-of-fit (how well the model explains the data)

  • Complexity (how simple the model is = fewer predictors is better)

c. Best subsets

We use the regsubsets() function to compare all possible models:

library(leaps)
# Run best subsets regression
best_models <- regsubsets(Systol ~ Age + Years + FracLife + Weight + Height + Chin + Forearm + Calf + Pulse,
                          data = peru_bp,
                          nvmax = 9, nbest = 1)

# Summarize output
summary(best_models)
## Subset selection object
## Call: regsubsets.formula(Systol ~ Age + Years + FracLife + Weight + 
##     Height + Chin + Forearm + Calf + Pulse, data = peru_bp, nvmax = 9, 
##     nbest = 1)
## 9 Variables  (and intercept)
##          Forced in Forced out
## Age          FALSE      FALSE
## Years        FALSE      FALSE
## FracLife     FALSE      FALSE
## Weight       FALSE      FALSE
## Height       FALSE      FALSE
## Chin         FALSE      FALSE
## Forearm      FALSE      FALSE
## Calf         FALSE      FALSE
## Pulse        FALSE      FALSE
## 1 subsets of each size up to 9
## Selection Algorithm: exhaustive
##          Age Years FracLife Weight Height Chin Forearm Calf Pulse
## 1  ( 1 ) " " " "   " "      "*"    " "    " "  " "     " "  " "  
## 2  ( 1 ) " " " "   "*"      "*"    " "    " "  " "     " "  " "  
## 3  ( 1 ) " " " "   "*"      "*"    " "    "*"  " "     " "  " "  
## 4  ( 1 ) "*" "*"   "*"      "*"    " "    " "  " "     " "  " "  
## 5  ( 1 ) "*" "*"   "*"      "*"    " "    "*"  " "     " "  " "  
## 6  ( 1 ) "*" "*"   "*"      "*"    " "    "*"  "*"     " "  " "  
## 7  ( 1 ) "*" "*"   "*"      "*"    "*"    "*"  "*"     " "  " "  
## 8  ( 1 ) "*" "*"   "*"      "*"    "*"    "*"  "*"     " "  "*"  
## 9  ( 1 ) "*" "*"   "*"      "*"    "*"    "*"  "*"     "*"  "*"

We can see which predictors are recommended to include for each number of predictors. Now we can identify the best based on various criteria:

# Create 2x2 grid of plots
par(mfrow = c(2, 2))

# Minimize SSE
plot(summary(best_models)$rss, xlab = "Number of predictors", ylab = "SSE", type = "l")

# Maximize Adjusted R^2
plot(summary(best_models)$adjr2, xlab = "Number of predictors", ylab = "Adjusted R²", type = "l")

# Minimize Mallow’s Cp
plot(summary(best_models)$cp, xlab = "Number of predictors", ylab = "Cp", type = "l")

# Minimize BIC
plot(summary(best_models)$bic, xlab = "Number of predictors", ylab = "BIC", type = "l")

How many predictors would you consider? Which one should be included?


Try it! Fit the model suggested by the best subset algorithm.

# Best 5 predictors
fit_5 <- lm(Systol ~ Age + Years + FracLife + Weight + Chin, 
               data = peru_bp)
summary(fit_5)
## 
## Call:
## lm(formula = Systol ~ Age + Years + FracLife + Weight + Chin, 
##     data = peru_bp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.520  -6.640  -1.093   4.893  16.366 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  109.3590    21.4843   5.090 1.41e-05 ***
## Age           -1.0120     0.3059  -3.308 0.002277 ** 
## Years          2.4067     0.7426   3.241 0.002723 ** 
## FracLife    -110.8112    27.2795  -4.062 0.000282 ***
## Weight         1.0976     0.2980   3.683 0.000819 ***
## Chin          -1.1918     0.6140  -1.941 0.060830 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.457 on 33 degrees of freedom
## Multiple R-squared:  0.6386, Adjusted R-squared:  0.5839 
## F-statistic: 11.66 on 5 and 33 DF,  p-value: 1.531e-06

Note: These methods do not consider any potential issues there might be with the model assumptions for example. We are also not considering potential interaction effects.

d. Multicollinearity

Multicollinearity occurs when two or more predictors are highly correlated with each other. This can make it difficult to estimate the individual effect of each predictor on the response.

One way to assess multicollinearity is to look at the Variance Inflation Factor (VIF). A VIF above 5 (or 10, depending on context) suggests potential multicollinearity.

library(car)
# Calculate VIF
vif(fit_5)

# Remove Years
fit_4 <- lm(Systol ~ Age + FracLife + Weight + Chin, 
               data = peru_bp)
summary(fit_4)
vif(fit_4)

Removing Years dropped the VIFs for the remaining predictors, helping to address multicollinearity.

2. Addressing Potential Issues

Let’s explore different datasets and address any potential issues that may come up!

a. Acoustic of woven fabric

We will work with data from the following study: Tang, X., Kong, D., Yan, X. (2018). Multiple Regression Analysis of a Woven Fabric Sound Absorber. Textile Research Journal. https://doi.org/10.1177/0040517518758001

# Upload the data from GitHub
acoustics <- read_csv("https://raw.githubusercontent.com/laylaguyot/datasets/main//acoustics.csv")
## Rows: 24 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (11): sampleID, thickness, diameter, perforation, weight, stiffness, air...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View first 6 rows
head(acoustics)
## # A tibble: 6 × 11
##   sampleID thickness diameter perforation weight stiffness airPerm acoustic0
##      <dbl>     <dbl>    <dbl>       <dbl>  <dbl>     <dbl>   <dbl>     <dbl>
## 1        1     0.547    0.269        6.14    247     182.     805.     0.076
## 2        2     0.541    0.378        6.08    253      74.4    888.     0.092
## 3        3     0.875    0.584        4.97    366      71.4    598.     0.129
## 4        4     0.64     0.527        4.87    319     162.     799.     0.115
## 5        5     0.75     0.534        4.51    292     156.     753.     0.095
## 6        6     0.539    0.368        4.39    232     172.     704      0.073
## # ℹ 3 more variables: acoustic1 <dbl>, acoustic2 <dbl>, acoustic3 <dbl>

Try it! Fit a model to predict the acoustic based on acoustic3 by (try 1 at a time): air permeability, weight, perforation ratio, thickness. Check assumptions. Any issues arise?

# Explore data
pairs.panels(acoustics[c("acoustic3","airPerm","weight","perforation","thickness")], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB", # color of histogram
             smooth = FALSE, density = FALSE, ellipses = FALSE)

# Fit model with perforation
fit_model <- lm(acoustic3 ~ perforation, data = acoustics)
summary(fit_model)
## 
## Call:
## lm(formula = acoustic3 ~ perforation, data = acoustics)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.059594 -0.033618  0.004946  0.027706  0.062671 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.512424   0.022057  23.232  < 2e-16 ***
## perforation -0.042188   0.006112  -6.903 6.24e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03803 on 22 degrees of freedom
## Multiple R-squared:  0.6841, Adjusted R-squared:  0.6698 
## F-statistic: 47.65 on 1 and 22 DF,  p-value: 6.238e-07
plot(fit_model)

# Try a transformation
acoustics |>
  mutate(perforation_inv = 1/perforation) -> acoustics

# Fit model
fit_model <- lm(acoustic3 ~ perforation_inv, data = acoustics)
summary(fit_model)
## 
## Call:
## lm(formula = acoustic3 ~ perforation_inv, data = acoustics)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.104264 -0.021164 -0.002658  0.029929  0.081648 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      0.25844    0.02209  11.697 6.49e-11 ***
## perforation_inv  0.32165    0.05829   5.518 1.52e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04382 on 22 degrees of freedom
## Multiple R-squared:  0.5805, Adjusted R-squared:  0.5615 
## F-statistic: 30.45 on 1 and 22 DF,  p-value: 1.519e-05
plot(fit_model)


b. Roasted hazelnuts

We will work with data from the following study: Şimşek, A. (2007). The use of 3D-nonlinear regression analysis in mathematics modeling of colour change in roasted hazelnuts. Journal of food engineering, 78(4), 1361-1370. https://doi.org/10.1016/j.jfoodeng.2006.01.008

# Upload the data from GitHub
hazelnuts <- read_csv("https://raw.githubusercontent.com/laylaguyot/datasets/main//hazelnuts.csv")
## Rows: 150 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): type, measure
## dbl (4): experiment, temperature, time, color_change
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View first 6 rows
head(hazelnuts)
## # A tibble: 6 × 6
##   experiment temperature  time color_change type  measure
##        <dbl>       <dbl> <dbl>        <dbl> <chr> <chr>  
## 1          1         125    15         81.0 fosa  whole  
## 2          2         125    25         76.8 fosa  whole  
## 3          3         125    35         76.3 fosa  whole  
## 4          4         125    45         74.6 fosa  whole  
## 5          5         125    55         73.6 fosa  whole  
## 6          6         135    13         79.3 fosa  whole

Try it! Fit a model to predict the color_change based on temperature and time. Check assumptions. Any issues arise?

# Explore data
pairs.panels(hazelnuts[c("color_change", "temperature","time")], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB", # color of histogram
             smooth = FALSE, density = FALSE, ellipses = FALSE)

# Fit model
fit_model <- lm(color_change ~ temperature + time, data = hazelnuts)

# Summary
summary(fit_model)
## 
## Call:
## lm(formula = color_change ~ temperature + time, data = hazelnuts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7564 -2.3274  0.2477  2.3283  6.0409 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 100.52162    2.85072  35.262  < 2e-16 ***
## temperature  -0.14738    0.01772  -8.316 5.64e-14 ***
## time         -0.24155    0.01995 -12.105  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.806 on 147 degrees of freedom
## Multiple R-squared:  0.5219, Adjusted R-squared:  0.5154 
## F-statistic: 80.22 on 2 and 147 DF,  p-value: < 2.2e-16
# Check assumptions
plot(fit_model)


Let’s consider a polynomial regression:

# Center the predictors and Squared predictors
hazelnuts <- hazelnuts |>
  mutate(temp_c = temperature - mean(temperature), 
         time_c = time - mean(time),
         temp2 = temp_c^2,
         time2 = time_c^2,
         temp_c.time_c = temp_c*time_c)

# Fit the polynomial regression model
fit_model <- lm(color_change ~ temp_c + temp2 + time_c + time2 + temp_c.time_c, 
                hazelnuts)

# Display the summary table for the regression model 
summary(fit_model)
## 
## Call:
## lm(formula = color_change ~ temp_c + temp2 + time_c + time2 + 
##     temp_c.time_c, data = hazelnuts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0395 -2.1019  0.0647  2.1811  4.9496 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   72.2383565  0.3949027 182.927  < 2e-16 ***
## temp_c        -0.1693516  0.0171207  -9.892  < 2e-16 ***
## temp2         -0.0018928  0.0013632  -1.389    0.167    
## time_c        -0.2885604  0.0194290 -14.852  < 2e-16 ***
## time2         -0.0007211  0.0016982  -0.425    0.672    
## temp_c.time_c -0.0091940  0.0018166  -5.061 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.502 on 144 degrees of freedom
## Multiple R-squared:  0.6277, Adjusted R-squared:  0.6148 
## F-statistic: 48.56 on 5 and 144 DF,  p-value: < 2.2e-16
# Check assumptions
plot(fit_model)

c. Cognition older adults

We will work with data from the following study: Sherwood, J., Inouye, C., Webb, S., Zhou A, Anderson, E., & Spink, N. (2019) Relationship between physical and cognitive performance in community dwelling, ethnically diverse older adults: a cross-sectional study. https://doi.org/10.7717/peerj.6159

# Upload the data from GitHub
cognition <- read_csv("https://raw.githubusercontent.com/laylaguyot/datasets/main//cognition.csv")
## Rows: 87 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Gender, Education
## dbl (7): ID, Age, Heart_Rate, MMS, Animal_Naming, TMTA, TMTB
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View first 6 rows
head(cognition)
## # A tibble: 6 × 9
##      ID Gender   Age Education     Heart_Rate   MMS Animal_Naming  TMTA  TMTB
##   <dbl> <chr>  <dbl> <chr>              <dbl> <dbl>         <dbl> <dbl> <dbl>
## 1     1 M         91 Graduate             109    96            20  34.8  154.
## 2     2 F         83 Undergraduate        112    98            18  45    157 
## 3     3 F         91 Undergraduate         99    78            10  30.5  122.
## 4     4 F         76 Graduate             105    94            16  49     58 
## 5     5 F         63 Highschool            88    94            21  26.6   77 
## 6     6 F         75 Undergraduate        109    94            19  40.9  128

Try it! Fit a model to predict TMTB based on Heart_Rate and Education. Check assumptions. Any issues arise?

# Explore data
pairs.panels(cognition[c("TMTB", "Heart_Rate","Education")], 
             method = "pearson", # correlation method
             hist.col = "#00AFBB", # color of histogram
             smooth = FALSE, density = FALSE, ellipses = FALSE)

# Fit model
fit_model <- lm(TMTB ~ Heart_Rate + Education, data = cognition)
summary(fit_model)
## 
## Call:
## lm(formula = TMTB ~ Heart_Rate + Education, data = cognition)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.651 -29.717  -8.452  17.209 169.778 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            151.0546    34.7086   4.352 3.82e-05 ***
## Heart_Rate              -0.6003     0.2761  -2.174 0.032546 *  
## EducationHighschool     52.0220    13.4825   3.858 0.000225 ***
## EducationUndergraduate  11.7843    12.6032   0.935 0.352488    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 47.36 on 83 degrees of freedom
## Multiple R-squared:  0.2721, Adjusted R-squared:  0.2458 
## F-statistic: 10.34 on 3 and 83 DF,  p-value: 7.43e-06
plot(fit_model)

# Try two transformations
cognition |>
  mutate(TMTB_log = log10(TMTB),
         TMTB_sqrt = sqrt(TMTB)) -> cognition

# Fit model with log
fit_model <- lm(TMTB_log ~ Heart_Rate + Education, data = cognition)
summary(fit_model)
## 
## Call:
## lm(formula = TMTB_log ~ Heart_Rate + Education, data = cognition)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37289 -0.14213 -0.00944  0.11190  0.52441 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.210853   0.133966  16.503  < 2e-16 ***
## Heart_Rate             -0.002939   0.001066  -2.758 0.007150 ** 
## EducationHighschool     0.201953   0.052039   3.881 0.000208 ***
## EducationUndergraduate  0.049082   0.048645   1.009 0.315922    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1828 on 83 degrees of freedom
## Multiple R-squared:  0.3066, Adjusted R-squared:  0.2815 
## F-statistic: 12.23 on 3 and 83 DF,  p-value: 1.051e-06
plot(fit_model)

# Fit model with sqrt
fit_model <- lm(TMTB_sqrt ~ Heart_Rate + Education, data = cognition)
summary(fit_model)
## 
## Call:
## lm(formula = TMTB_sqrt ~ Heart_Rate + Education, data = cognition)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7453 -1.4760 -0.1839  1.0811  7.0060 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            12.44816    1.56914   7.933 8.84e-12 ***
## Heart_Rate             -0.03140    0.01248  -2.515 0.013815 *  
## EducationHighschool     2.39838    0.60953   3.935 0.000172 ***
## EducationUndergraduate  0.57306    0.56978   1.006 0.317452    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.141 on 83 degrees of freedom
## Multiple R-squared:  0.2964, Adjusted R-squared:  0.271 
## F-statistic: 11.65 on 3 and 83 DF,  p-value: 1.893e-06
plot(fit_model)

The log-transformation addressed the issues with the assumptions the best.


d. Gaming and gambling

We will work with data from the following study: Zendle, D. (2020) Beyond loot boxes: a variety of gambling-like practices in video games are linked to both problem gambling and disordered gaming. https://doi.org/10.7717/peerj.9466

# Upload the data from GitHub
gaming <- read_csv("https://raw.githubusercontent.com/laylaguyot/datasets/main//gaming.csv")
## Rows: 1108 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): mainGame
## dbl (12): age, gender, ethnicity, gameplayFreq, gambling, esports, token, si...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# View first 6 rows
head(gaming)
## # A tibble: 6 × 13
##     age gender ethnicity gameplayFreq gambling esports token simulated
##   <dbl>  <dbl>     <dbl>        <dbl>    <dbl>   <dbl> <dbl>     <dbl>
## 1    38      0         1            1       11       0     0         0
## 2    50      0         0            1        6       0     0         0
## 3    32      0         0            1        6       0     0         0
## 4    38      0         0            1        9       0     0         1
## 5    60      0         0            1        0       0     0         0
## 6    58      0         0            1        3       0     0         0
## # ℹ 5 more variables: real_money <dbl>, lootbox <dbl>, problem_gambling <dbl>,
## #   gaming_disorder <dbl>, mainGame <chr>

Try it! Which variable(s) should be considered as the outcome? Fit a model to predict this outcome based on some predictors.

Some reasonable outcomes to model would be indicators of problem_gambling or gaming_disorder. We should fit a logistic regression model.